April 21, 2020

Plan for this Week

First, some quick tips

Useful quick tips

R Comments

Write comments in your code after # (in Rmd docs # only works within chunks)

my.vec1 <- c("some","word") # this is a comment
my.vec2 <- c("some","other","word") # this is also a comment

Saving R objects

 save(list=c("my.vec1","my.vec2"),file = "MyCharVecs.RData")

Loading objects into R

load("MyCharVecs.RData")

Where are these files being saved to and loaded from?

Useful quick tips

Working Directory

R saves and looks for files in your current working directory. To see what it is, use:

getwd()
## [1] "/cloud/project"

You can also set your session to a working directory

setwd("C:/Users/dtr/theDirectory")

Working dirs in R Markdown docs are set automatically to where the Rmd file is stored

Useful quick tips

Managing Files

Give each project (e.g., a homework) its own folder. Here is my system:

  • Every class or project has its own folder

  • Each assignment or task has a folder inside that, which is the working directory for that item.

  • .Rmd and .R files are named clearly and completely

For example, this presentation is located and named this:

Lectures/Lectures_Week04/DataVisualization02.Rmd

Use whatever system you want, but be consistent!

Grammar of Graphics

The Grammar of Graphics

  • Visualisation concept created by Wilkinson (1999)
    • defines basic elements of a statistical graphic
  • Adapted for R by Wickham (2009)
    • consistent and compact syntax to describe statistical graphics
    • highly modular as it breaks up graphs into semantic components
  • It’s not a guide to which graph to use and how to best convey your data

Terminology

A statistical graphic is a…

  • mapping of data
  • which may be statistically transformed (summarised, log-transformed, etc.)
  • to aesthetic attributes (color, size, xy-position, etc.)
  • using geometric objects (points, lines, bars, etc.)
  • and mapped onto a specific facet and coordinate system

Plotting in ggplot2

ggplot2

  • ggplot2 is based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts

  • It takes care of many of the fiddly details that make plotting a hassle (like drawing legends or faceting (e.g., legend, mfrow, mfcol, layout)

  • Powerful model for graphical representation of data, simplifies making complex multi-layered visualizations

Anatomy of a ggplot

ggplot2 package build plots by layers

  • data ggplot

  • geometry: geom_point, geom_line, geom_smooth, geom_bar, …

  • titles and axis labels: ggtitle, lab, xlab, ylab

  • themes: theme, theme_bw, theme_classic, …

  • facets: facet_wrap and facet_grid

Layers are separated by a + sign.

Anatomy of a ggplot

ggplot2 defines aesthetics within each layer

Aesthetics, to control the appearance of the layers (e.g., point/line colors or transparency – alpha between 0 and 1)

  • x, y: \(x\) and \(y\) coordinate values to use
  • color: set color of elements based on some data value
  • group: describe which points are conceptually grouped together for the plot (often used with lines)
  • size: set size of points/lines based on some data value
  • alpha: set transparency based on some data value

Anatomy of a ggplot

Aesthetics: mappings vs settings

Aesthetic Settings:

  • These don’t depend on the data and can be specified directly on the layers

  • Some are: color, size, linetype, shape, fill, and alpha

  • See the ggplot2 documentation

Aesthetic Mappings:

  • Arguments inside aes() that depend on the data, e.g. geom_point(aes(color = continent))
  • aes() in the ggplot() layer gives overall aesthetics to use in other layers
  • aes() can be changed on individual layers

Now, let’s build a ggplot by steps

Making a ggplot

data(gapminder)
China <- gapminder[gapminder$country == "China",]
head(China, 4)
## # A tibble: 4 x 6
##   country continent  year lifeExp       pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>     <int>     <dbl>
## 1 China   Asia       1952    44   556263527      400.
## 2 China   Asia       1957    50.5 637408000      576.
## 3 China   Asia       1962    44.5 665770000      488.
## 4 China   Asia       1967    58.4 754550000      613.

Making a ggplot: the base plot

… the data and global aesthetics

ggplot(data = China, 
       aes(x = year, y = lifeExp))

Making a ggplot: the geometry

… including a geom (a scatterplot)

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
    geom_point()

Making a ggplot: some aesthetics

… adding color and changing the size

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
    geom_point(color = "red", size = 3)

Making a ggplot: axis labels

… adding the x-label

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  xlab("Year")

Making a ggplot: axis labels

… adding the y-label

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  xlab("Year") + ylab("Life Expectancy")

Making a ggplot: title

… adding the title

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy in China")

Making a ggplot: theme

… choosing a theme

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy in China")  +
  theme_bw()

Making a ggplot: theme

… changing the text size

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy in China")  +
  theme_bw(base_size = 14)

Making a ggplot

… what if we want to see all countries?

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy in China")  +
  theme_bw(base_size = 14)

Oooops, this did not work!!!

Can’t separate countries.

Making a ggplot

… maybe use lines instead of points?

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp)) +
  geom_line(color = "red") +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy in China")  +
  theme_bw(base_size = 14)

ggplot can’t tell them apart, need to tell it how!

Making a ggplot: group aesthetic

… tell ggplot to group by country

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country)) +
  geom_line(color = "red") +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy in China")  +
  theme_bw(base_size = 14)

still hard to see patterns…

  • Let’s also make the lines narrower

  • Are there patterns by continent?

Making a ggplot: color aesthetic

… tell ggplot to color by continent

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country,
           colour = continent)) +
  geom_line(color = "red",
            lwd = 0.3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 14)

color didn’t change, why?

Making a ggplot: color aesthetic

… tell ggplot to color by continent

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country,
           color = continent)) +
  geom_line(lwd = 0.3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 8)

ok better, but crammed

  • There are clear paterns by continent, but hard to see here
  • Let’s get separate figures by continent

Making a ggplot: facets

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line(lwd = 0.3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 8) + 
  facet_wrap(continent~.)

Making a ggplot: facets

Making a ggplot: legend options

… change legend position

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line(lwd = 0.3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 8) + 
  facet_wrap(~ continent) +
  theme(legend.position = c(0.8, 0.25))

Making a ggplot: legend options

… changed legend position, but don’t really need it

Making a ggplot: legend options

… remove legend

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line(lwd = 0.3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 8) + 
  facet_wrap(~ continent) +
  theme(legend.position = c(0.8, 0.25))

Making a ggplot: legend options

legend.position = "none"

Making a ggplot: more on faceting

… alternative faceting

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line(lwd = 0.3) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 8) + 
  facet_grid(cols = vars(continent)) +
  theme(legend.position = "none")

Making a ggplot: more on faceting

… alternative faceting

facet_grid by cols

Making a ggplot: adding a smooth

… get averages by continent

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line(lwd=0.1, alpha=0.5) +
  geom_line(stat = "smooth", method = "loess",
            aes(group = continent, color = continent)) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 8) + 
  facet_grid(cols = vars(continent)) +
  theme(legend.position = "none")

the command alpha modifies the transparency

Making a ggplot: adding a smooth

… get averages by continent

## `geom_smooth()` using formula 'y ~ x'

Making a ggplot: stored ggplots

… assign a ggplot object to a name

my.fist.plot <- ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country)) +
  geom_line(lwd=0.1, alpha=0.5) +
  geom_line(stat = "smooth", method = "loess", #<<
            aes(group = continent, color = continent)) +
  xlab("Year") + ylab("Life Expectancy") +
  ggtitle("Life expectancy over time")  +
  theme_bw(base_size = 8) + 
  facet_grid(cols = vars(continent)) +
  theme(legend.position = "none")
  • This plot won’t be displayed until you type in the object name
  • You can also take the object and add more layers!!

Making a ggplot: stored ggplots

… show the stored plot

my.fist.plot
## `geom_smooth()` using formula 'y ~ x'

Making a ggplot: stored ggplots

… adding layers to a stored plot

my.fist.plot + theme(legend.position = "bottom")
## `geom_smooth()` using formula 'y ~ x'

In-class exercise

Explore other relationships in the gapminder data using what you learned today, could be considering other variables in the data set, or using an alternative geometry or faceting with other variables. Just make one figure, but using as many of the concepts you learned as possible.

This data visualization cheat sheet might be helpful

Summary from previous lecture

The base layer ggplot

  • ggplot() initializes a ggplot object

  • declares the input data and global aesthetics

  • add layers by using the + operator

The geometry layer geom

geom_[some geom](mapping = NULL, data = NULL, stat, ...)
  • mapping list of aesthethic assignments aes() for geom object

  • stat statistical transformation required for geom object

  • NULL setting indicating to inherit values from ggplot()

  • ... other args, often aesthetics you want to set unconditionally of the data, e.g. color="green"

Aesthetic Mappings

Besides mapping onto x- and y-position variables can be assigned to geom aesthetics

Examples:

geom_point(aes(x=year, 
               y=lifeExp, 
               size = pop ))#: point size varies with `pop`
aes(..., color = continent)#: color varies with `continent`
aes(..., fill = continent)#: fill color varies with `continent`
aes(..., linetype = country)#: linetype varies with `country`

Anatomy of a ggplot

ggplot(data = [dataframe], 
       mapping=aes(x = [var_x], y = [var_y], 
                   color = [var_for_color], 
                   fill = [var_for_fill], 
                   shape = [var_for_shape]),
       stat=[stat_transf],
       position=[pos_adjust]
       )  +
  geom_[some_geom]([geom_arguments]) +
  ... + # other geometries
  facet_[some_facet]([formula]) +
  xlab([an x label]) + ylab([a y label]) +
  ggtitle(label = [a title], subtitle=[a subtitle]) +
  scale_[some_axis]_[some_scale]([scale_arguments]) +
  ... # other options

Anatomy of a ggplot

ggplot(data = [dataframe], 
       mapping=aes(x = [var_x], y = [var_y], 
                   color = [var_for_color], 
                   fill = [var_for_fill], 
                   shape = [var_for_shape]),
       stat=[stat_transf],
       position=[pos_adjust]
       )  +
  geom_[some_geom]([geom_arguments]) +
  ... + # other geometries
  facet_[some_facet]([formula]) +
  xlab([an x label]) + ylab([a y label]) +
  ggtitle(label = [a title], subtitle=[a subtitle]) +
  scale_[some_axis]_[some_scale]([scale_arguments]) +
  ... # other options

Anatomy of a ggplot

ggplot(data = [dataframe], 
       mapping=aes(x = [var_x], y = [var_y], 
                   color = [var_for_color], 
                   fill = [var_for_fill], 
                   shape = [var_for_shape]),
       stat=[stat_transf],
       position=[pos_adjust]
       )  +
  geom_[some_geom]([geom_arguments]) +
  ... + # other geometries
  facet_[some_facet]([formula]) +
  xlab([an x label]) + ylab([a y label]) +
  ggtitle(label = [a title], subtitle=[a subtitle]) +
  scale_[some_axis]_[some_scale]([scale_arguments]) +
  ... # other options

Anatomy of a ggplot

ggplot(data = [dataframe], 
       mapping=aes(x = [var_x], y = [var_y], 
                   color = [var_for_color], 
                   fill = [var_for_fill], 
                   shape = [var_for_shape]),
       stat=[stat_transf],
       position=[pos_adjust]
       )  +
  geom_[some_geom]([geom_arguments]) +
  ... + # other geometries
  facet_[some_facet]([formula]) +
  xlab([an x label]) + ylab([a y label]) +
  ggtitle(label = [a title], subtitle=[a subtitle]) +
  scale_[some_axis]_[some_scale]([scale_arguments]) +
  ... # other options

Anatomy of a ggplot

ggplot(data = [dataframe], 
       mapping=aes(x = [var_x], y = [var_y], 
                   color = [var_for_color], 
                   fill = [var_for_fill], 
                   shape = [var_for_shape]),
       stat=[stat_transf],
       position=[pos_adjust]
       )  +
  geom_[some_geom]([geom_arguments]) +
  ... + # other geometries
  facet_[some_facet]([formula]) +
  xlab([an x label]) + ylab([a y label]) +
  ggtitle(label = [a title], subtitle=[a subtitle]) +
  scale_[some_axis]_[some_scale]([scale_arguments]) +
  ... # other options

Other geometries

Movies data

  • The data set is comprised of 651 randomly sampled movies produced and released before 2016.

  • Data come from IMDB and Rotten Tomatoes.

  • The codebook is available here.

Movies data

movies = readr::read_csv("data/movies.csv") 
movies
## # A tibble: 651 x 32
##    title title_type genre runtime mpaa_rating studio thtr_rel_year
##    <chr> <chr>      <chr>   <dbl> <chr>       <chr>          <dbl>
##  1 Fill… Feature F… Drama      80 R           Indom…          2013
##  2 The … Feature F… Drama     101 PG-13       Warne…          2001
##  3 Wait… Feature F… Come…      84 R           Sony …          1996
##  4 The … Feature F… Drama     139 PG          Colum…          1993
##  5 Male… Feature F… Horr…      90 R           Ancho…          2004
##  6 Old … Documenta… Docu…      78 Unrated     Shcal…          2009
##  7 Lady… Feature F… Drama     142 PG-13       Param…          1986
##  8 Mad … Feature F… Drama      93 R           MGM/U…          1996
##  9 Beau… Documenta… Docu…      88 Unrated     Indep…          2012
## 10 The … Feature F… Drama     119 Unrated     IFC F…          2012
## # … with 641 more rows, and 25 more variables: thtr_rel_month <dbl>,
## #   thtr_rel_day <dbl>, dvd_rel_year <dbl>, dvd_rel_month <dbl>,
## #   dvd_rel_day <dbl>, imdb_rating <dbl>, imdb_num_votes <dbl>,
## #   critics_rating <chr>, critics_score <dbl>, audience_rating <chr>,
## #   audience_score <dbl>, best_pic_nom <chr>, best_pic_win <chr>,
## #   best_actor_win <chr>, best_actress_win <chr>, best_dir_win <chr>,
## #   top200_box <chr>, director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## #   actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>

Histograms

ggplot(data = movies, aes(x = audience_score)) +
  geom_histogram(binwidth = 5)

Boxplots

ggplot(data = movies, aes(y = audience_score, x = genre)) +
  geom_boxplot()

Terrible x-axis labels

Boxplots - axis formatting

ggplot(data = movies, aes(y = audience_score, x = genre)) +
  geom_boxplot() +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Fixed using the axis.text.x option in theme

Density plots

ggplot(data = movies, aes(x = runtime)) +
  geom_density() 

Smoothing - loess (the default)

ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) +
  geom_point(alpha = 0.5) +
  geom_smooth()

Smoothing - lm

ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")

Barplots

ggplot(data = movies, aes(x = genre)) +
  geom_bar() +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

More aesthetics

Density plots - border color

ggplot(data = movies, aes(x = runtime, color = audience_rating)) +
  geom_density() 

Density plots - fill color

ggplot(data = movies, aes(x = runtime, fill = audience_rating)) +
  geom_density() 

Density plots - fill color, with alpha

ggplot(data = movies, aes(x = runtime, fill = audience_rating)) +
  geom_density(alpha = 0.5) 

Segmented barplots

ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
  geom_bar() +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Segmented barplots - proportions

ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
  geom_bar(position = "fill") + ylab("proportions") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Dodged barplots

ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
  geom_bar(position = "dodge") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Scales

Scales

The Scale is a realization of data values in terms of asthetic/physical values

  • control the mapping of data (domain) to aesthetics (range)
  • each aesthethic has its own (default) scale
  • scale depends on the variable type:
  • discrete (factor, logical, character)
  • continuous (numeric)

Scales

Scale specifications have the form

scale_AESTHETIC_SCALENAME()


  • AESTHETIC x, y, color, fill, linetype, size or shape

  • SCALENAME grey, gradient, hue, manual, continuous, etc

Axis scales

ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) +
  geom_point(alpha = 0.5) +
  scale_x_log10() +
  scale_y_sqrt()

Axis scales

ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) +
  geom_point(alpha = 0.5) +
  scale_x_continuous(trans="identity", breaks=seq(10,100,10), limits=c(1,100)) +
  scale_y_continuous(trans="identity", breaks=c(1,20,50,100), limits=c(1,100))

Some color scales: Viridis

ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) +
  geom_point() +
  scale_color_viridis_d()

Some color scales: Brewer

Color Brewer

ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) +
  geom_point() +
  scale_color_brewer(palette = "Accent")

Scales again

ggplot(data = movies, aes(x = runtime, fill = audience_rating)) +
  geom_density(alpha = 0.5) +
  scale_x_log10()

Scales again

ggplot(data = movies, aes(x = runtime, fill = audience_rating)) +
  geom_density() +
  scale_x_log10() + 
  scale_fill_manual(values=c("#4B9CD3","#001A57"))

In-class Exercise: Recreate this plot

Statisical Transformations

Statisical Transformations

geom_bar(mapping = NULL, 
         data = NULL, 
         stat = "bin", 
         position = "stack",...)
  • stat statistically transforms input data (bin means bin and count)

  • position dodges for side-by-side bars or stack for additive bars

Statisical Transformations

ggplot(data = movies, aes(x = audience_score)) + geom_bar(stat="bin") 

Statisical Transformations

ggplot(data = movies, aes(x = audience_score)) + 
  geom_bar(stat="bin", binwidth = 20) 

binwidth specifies the number of bins

Let’s do an exercise together

  • Do the transformation stat=“bin” by hand

  • Cut audience_score into groups (0,20] (20,40] (40,50] (60,80] (80,100] with

cut(movies$audience_score, breaks=seq(0,100,5))
  • Count observations by group and save with
count.df <- as.data.frame(table(movies$audience_score))
  • Use ggplot() + geom_bar()

Question: what stat argument do you need?

Let’s do an exercise together

aud_scorecut <- cut(movies$audience_score, breaks=seq(0,100,20))
count.df <- as.data.frame(table(aud_scorecut))
count.df
##   aud_scorecut Freq
## 1       (0,20]   13
## 2      (20,40]  104
## 3      (40,60]  164
## 4      (60,80]  217
## 5     (80,100]  153

Let’s do an exercise together

No good, why?

ggplot(count.df, aes(x=aud_scorecut, y=Freq)) + 
  geom_bar(stat="bin")  
## Error: stat_bin() can only have an x or y aesthetic.

Let’s do an exercise together

Ok, so how about this?

ggplot(count.df, aes(x=aud_scorecut)) + 
  geom_bar(stat="bin")  
## Error: StatBin requires a continuous x variable: the x variable is discrete.Perhaps you want stat="count"?

Let’s do an exercise together

ggplot(count.df, aes(x=aud_scorecut, y=Freq)) + 
  geom_bar(stat="identity")  

Plug for possible extra topic: Shiny

Shiny

  • A web application framework for R with which you can easily turn your analyses into interactive web applications

  • No HTML, CSS, or JavaScript knowledge required

Live demo

Acknowledgements